Training a Genre Classifier for Automatic Classification of Web Pages
نویسندگان
چکیده
منابع مشابه
Training a Genre Classifier for Automatic Classification of Web Pages
This paper presents experimentson classifyingweb pages by genre. Firstly, a corpus of 1 539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) a...
متن کاملAutomatic Genre Classification in Web Pages Applied to Web Comments
Automatic Web comment detection could significantly facilitate information retrieval systems, e.g., a focused Web crawler. In this paper, we propose a text genre classifier for Web text segments as intermediate step for Web comment detection in Web pages. Different feature types and classifiers are analyzed for this purpose. We compare the two-level approach to state-ofthe-art techniques operat...
متن کاملGenre Classification of Web Pages
Genre classification means to discriminate between documents by means of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents’ contents. While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea here is applied to arbitrary Web pages. W...
متن کاملSome Issues in Automatic Genre Classification of Web Pages
In this paper, two experiments in automatic genre classification of web pages are presented. These two experiments are designed to highlight three important issues related to genre classification: corpus composition and genre palettes, feature representativeness, and exportability of classification models. Results show the influence of corpus composition and genre palette on classification rate...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Computing and Information Technology
سال: 2007
ISSN: 1330-1136,1846-3908
DOI: 10.2498/cit.1001137